For the following questions, I am going to bring a number of Stock and ETF files into one single data frame, and inspect for further analyses in the future.
Especially, for each Stock and ETF data, I will examine whether there are anomalous values and, if so, why those anomalies happened in the data.
In addition, I will conduct feature engineering for later analyses.
library(tidyverse)
library(RcppRoll)
library(plotly)
stock_list <-list.files("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks")
etf_list <-list.files("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs")
length(stock_list);length(etf_list)
## [1] 7196
## [1] 1345
There are total \(7195\) files in Stock file and \(1345\) files in ETFs file.
However, I was able to find some blank files (with no data at all) in Stock ones.
As such, I applied additional file.info(file)$size > 0 when making a code to bring Stock files.
# Make one common Stock File
setwd("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks")
for (file in stock_list){
if (!exists("stock_data")){
stock_data <- read_csv(file)
stock_data$Symbol <-gsub('(.*).us.txt','\\1',file)
write_csv(stock_data,"stock_data.csv", col_names = TRUE)
}
if (exists("stock_data") & file.info(file)$size > 0){
temp_data <-read_csv(file)
temp_data$Symbol <- gsub('(.*).us.txt','\\1',file)
write_csv(temp_data, "stock_data.csv", append = TRUE)
}
}
stock_DF<-read_csv("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks/stock_data.csv")
dim(stock_DF)
## [1] 27640460 8
# Make one common ETF File
setwd("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs")
for (file in etf_list){
if (!exists("etf_data")){
etf_data <- read_csv(file)
etf_data$Symbol <-gsub('(.*).us.txt','\\1',file)
write_csv(etf_data,"etf_data.csv", col_names = TRUE)
}
if (exists("etf_data")){
temp_data <-read_csv(file)
temp_data$Symbol <- gsub('(.*).us.txt','\\1',file)
write_csv(temp_data, "etf_data.csv", append = TRUE)
}
}
etf_DF<-read_csv("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs/etf_data.csv")
dim(etf_DF)
## [1] 3112903 8
summary(stock_DF)
## Date Open High
## Min. :1962-01-02 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:2007-12-03 1st Qu.:8.000e+00 1st Qu.:8.000e+00
## Median :2012-02-17 Median :1.600e+01 Median :1.600e+01
## Mean :2010-06-20 Mean :3.268e+04 Mean :3.356e+04
## 3rd Qu.:2015-05-20 3rd Qu.:2.900e+01 3rd Qu.:2.900e+01
## Max. :2017-11-10 Max. :1.424e+09 Max. :1.442e+09
## NA's :4246 NA's :4854
## Low Close Volume OpenInt
## Min. : -1 Min. :0.000e+00 Min. :0.000e+00 Min. :0
## 1st Qu.: 8 1st Qu.:8.000e+00 1st Qu.:3.239e+04 1st Qu.:0
## Median : 15 Median :1.600e+01 Median :1.900e+05 Median :0
## Mean : 31576 Mean :3.253e+04 Mean :1.587e+06 Mean :0
## 3rd Qu.: 28 3rd Qu.:2.900e+01 3rd Qu.:8.788e+05 3rd Qu.:0
## Max. :1362117844 Max. :1.438e+09 Max. :2.070e+09 Max. :0
## NA's :4694 NA's :8114 NA's :4
## Symbol
## Length:27640460
## Class :character
## Mode :character
##
##
##
##
I found several NAs over four columns. As such, I decided to examine and remove them, unless they are critical.
# Five examples with NA in the `High` column
stock_DF %>% filter(is.na(High)) %>% head()
## # A tibble: 6 x 8
## Date Open High Low Close Volume OpenInt Symbol
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 2012-04-05 453 NA 441 NA 8391 0 aezs
## 2 2012-04-09 498 NA 462 480 7694 0 aezs
## 3 2012-04-13 414 NA 396 402 2848 0 aezs
## 4 2012-04-18 378 NA NA NA 2446 0 aezs
## 5 2012-04-19 NA NA NA 375 1803 0 aezs
## 6 2012-04-24 NA NA NA NA 1670 0 aezs
So, I will remove those NAs and Low==-1 that we have already detected during the first class.
stock <- stock_DF[complete.cases(stock_DF),]
stock <-stock[stock$Low!=-1,]
stock$Symbol <- toupper(stock$Symbol) #Just because UPPER Symbol is more familiar
summary(stock)
## Date Open High
## Min. :1962-01-02 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:2007-11-30 1st Qu.:8.000e+00 1st Qu.:8.000e+00
## Median :2012-02-16 Median :1.600e+01 Median :1.600e+01
## Mean :2010-06-20 Mean :3.268e+04 Mean :3.356e+04
## 3rd Qu.:2015-05-20 3rd Qu.:2.900e+01 3rd Qu.:2.900e+01
## Max. :2017-11-10 Max. :1.424e+09 Max. :1.442e+09
## Low Close Volume OpenInt
## Min. :0.000e+00 Min. :0.000e+00 Min. :0.000e+00 Min. :0
## 1st Qu.:8.000e+00 1st Qu.:8.000e+00 1st Qu.:3.240e+04 1st Qu.:0
## Median :1.500e+01 Median :1.600e+01 Median :1.901e+05 Median :0
## Mean :3.158e+04 Mean :3.252e+04 Mean :1.587e+06 Mean :0
## 3rd Qu.:2.800e+01 3rd Qu.:2.900e+01 3rd Qu.:8.792e+05 3rd Qu.:0
## Max. :1.362e+09 Max. :1.438e+09 Max. :2.070e+09 Max. :0
## Symbol
## Length:27630232
## Class :character
## Mode :character
##
##
##
Now, I am going to explore the data graphically.
stock %>%
filter(Symbol=='AAPL'|Symbol=="IBM"|Symbol=="GOOGL")%>%
filter(Date>"2010-01-01")%>%
ggplot(aes(x=Date))+
geom_line(aes(y=High), colour='blue')+
geom_line(aes(y=Low), colour='grey')+
ylab(label="High and Low")+
facet_wrap(~Symbol, scales="free_y")+
ggtitle("Three Companies' Stocks Trends over time (High&Low)")
Firstly, I looked at the three major tech companies’ stock prices (Daily High & Low) with adjusted scales.
I was able to find that when one’s stock price shows a rising tendency, it tends to have smaller gap between High and Low prices.
stock%>%
ggplot(aes(x=Volume))+geom_histogram()+
ggtitle("Stocks Volume Histogram")
stock %>%
ggplot(aes(x=log(Volume)))+geom_histogram()+
ggtitle("Stocks log(Volume) Histogram")
As we examined during the class, since the Volumne variable is highly right skewed, we would better to use log(Volume).
I decided to do feature engineering to see more general trends from stock data.
stock <- stock %>%
group_by(Symbol) %>%
mutate(Open_Change=Open-lag(Open),
High_Change=High-lag(High),
Low_Change=Low-lag(Low),
Close_Change=Close-lag(Close),
Volume_Change=Volume-lag(Volume)) %>%
mutate(Open_PctChange = (Open/lag(Open)-1) * 100,
High_PctChange= (High/lag(High)-1) * 100,
Low_PctChange= (Low/lag(Low)-1) * 100,
Close_PctChange= (Close/lag(Close)-1) * 100,
Volume_PctChange= (Volume/lag(Volume)-1) * 100) %>%
mutate(Open_Mean30=roll_mean(Open, 30, na.rm=TRUE, align="right", fill = NA),
High_Mean30=roll_mean(High, 30, na.rm=TRUE, align="right", fill = NA),
Low_Mean30=roll_mean(Low, 30, na.rm=TRUE, align="right", fill = NA),
Close_Mean30=roll_mean(Close, 30, na.rm=TRUE, align="right", fill = NA),
Volume_Mean30=roll_mean(Volume, 30, na.rm=TRUE, align="right", fill = NA)) %>%
mutate(Open_Ch_Mean30=roll_mean(Open_Change, 30, na.rm=TRUE, align="right", fill = NA),
High_Ch_Mean30=roll_mean(High_Change, 30, na.rm=TRUE, align="right", fill = NA),
Low_Ch_Mean30=roll_mean(Low_Change, 30, na.rm=TRUE, align="right", fill = NA),
Close_Ch_Mean30=roll_mean(Close_Change, 30, na.rm=TRUE, align="right", fill = NA),
Volumne_Ch_Mean30=roll_mean(Volume_Change, 30, na.rm=TRUE, align="right", fill = NA)) %>%
mutate(Open_PctCh_Mean30=roll_mean(Open_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
High_PctCh_Mean30=roll_mean(High_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Low_PctCh_Mean30=roll_mean(Low_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Close_PctCh_Mean30=roll_mean(Close_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Volumne_PctCh_Mean30=roll_mean(Volume_PctChange, 30, na.rm=TRUE, align="right", fill = NA)) %>%
ungroup()
tail(stock)
## # A tibble: 6 x 33
## Date Open High Low Close Volume OpenInt Symbol Open_Change
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr> <dbl>
## 1 2017-11-03 9.83 10.5 9.83 10.3 531495 0 ZYNE 0.240
## 2 2017-11-06 10.4 11.5 10.4 11.2 977948 0 ZYNE 0.590
## 3 2017-11-07 11.3 11.4 10.7 10.8 451210 0 ZYNE 0.880
## 4 2017-11-08 10.7 11.1 10.4 10.9 336449 0 ZYNE -0.600
## 5 2017-11-09 11.0 11.9 11.0 11.6 463067 0 ZYNE 0.300
## 6 2017-11-10 11.7 13.2 11.3 12.5 885587 0 ZYNE 0.680
## # ... with 24 more variables: High_Change <dbl>, Low_Change <dbl>,
## # Close_Change <dbl>, Volume_Change <int>, Open_PctChange <dbl>,
## # High_PctChange <dbl>, Low_PctChange <dbl>, Close_PctChange <dbl>,
## # Volume_PctChange <dbl>, Open_Mean30 <dbl>, High_Mean30 <dbl>,
## # Low_Mean30 <dbl>, Close_Mean30 <dbl>, Volume_Mean30 <dbl>,
## # Open_Ch_Mean30 <dbl>, High_Ch_Mean30 <dbl>, Low_Ch_Mean30 <dbl>,
## # Close_Ch_Mean30 <dbl>, Volumne_Ch_Mean30 <dbl>,
## # Open_PctCh_Mean30 <dbl>, High_PctCh_Mean30 <dbl>,
## # Low_PctCh_Mean30 <dbl>, Close_PctCh_Mean30 <dbl>,
## # Volumne_PctCh_Mean30 <dbl>
stock %>%
filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
filter(Date>"2010-01-01")%>%
ggplot(aes(x=Date))+
geom_line(aes(y=High_Mean30), color="blue")+
geom_line(aes(y=Low_Mean30), color="grey")+
ylab(label="High_Mean30 and Low_Mean30")+
facet_wrap(~Symbol, scales="free_y")+
ggtitle("Three Companies' Mean High & Low Prices every 30 Days")
Compare to the previous Open price plot(“Three Companies’ Stocks Trends over time (High&Low)),
now the plot has way less granularity with maintaining the general trends.
As such, I am going to keep using Open_Mean30 instead of Open to find interesting points from the data.
stock %>%
filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
filter(Date>"2016-01-01")%>%
ggplot(aes(x=Date, y=Open_Ch_Mean30, color=Symbol))+
geom_line()+
ggtitle("Three Companies' Avg. Change in Open Prices every 30 Days")
stock %>%
filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
filter(Date>"2016-01-01")%>%
ggplot(aes(x=Date, y=Open_PctCh_Mean30, color=Symbol))+
geom_line()+
ggtitle("Three Companies' Avg. % Change in Open Prices every 30 Days")
In addition to making XXXX_Mean30, I also added average change every 30 days, such as Open_Ch_Mean30, and average % change every 30 days, such as Open_PctCh_Mean30.
By doing so, I wanted to see the general fluctuation changes of stock prices.
From the first plot, it looks like GOOGL‘s open stock price fluctuates the most. While this is true, though, it is mainly because the company has the biggest stock quantity.
As we can see from the second plot, which is three tech companies’ average % change in stock prices, the companies tend to show similar trends, except that IBM has a very different tendency during the first and second quarter of 2017. We may want to examine what happend during that time.
stock %>%
filter(Date>="2008-05-01" & Date <="2009-12-31") %>%
filter(Symbol=='XL'|Symbol=="GNW") %>%
ggplot(aes(x=Date, y=Open_Mean30, color=Symbol))+
geom_line()+
ggtitle("Stocks that Crashed in 2008")+
geom_vline(xintercept = as.Date(c("2008-09-01","2008-10-31")))
stock %>%
filter(Date>="2008-05-01" & Date <="2009-12-31") %>%
filter(Symbol=='AMGN'|Symbol=="WMT") %>%
ggplot(aes(x=Date, y=Open_Mean30, color=Symbol))+
geom_line()+
ggtitle("Stocks that Thrived in 2008")+
geom_vline(xintercept = as.Date(c("2008-09-01","2008-09-30")))
The first plot depicts two stocks that performed the worst during the recession, and the second one describes two stocks that relatively thrived during the recession in 2008.
From further analyses, we might be able to analyze the reasons why some stocks especially crashed, while the others survived well.
summary(etf_DF)
## Date Open High
## Min. :1999-03-10 Min. : 0 Min. : 0
## 1st Qu.:2010-12-17 1st Qu.: 23 1st Qu.: 23
## Median :2013-09-24 Median : 35 Median : 35
## Mean :2013-03-25 Mean : 1860 Mean : 1954
## 3rd Qu.:2015-11-10 3rd Qu.: 55 3rd Qu.: 55
## Max. :2017-11-10 Max. :34116000 Max. :37152000
## NA's :745 NA's :752
## Low Close Volume OpenInt
## Min. : 0 Min. : 0 Min. :0.000e+00 Min. :0
## 1st Qu.: 22 1st Qu.: 23 1st Qu.:9.002e+03 1st Qu.:0
## Median : 34 Median : 35 Median :4.489e+04 Median :0
## Mean : 1782 Mean : 1854 Mean :1.427e+06 Mean :0
## 3rd Qu.: 54 3rd Qu.: 55 3rd Qu.:2.372e+05 3rd Qu.:0
## Max. :33696000 Max. :35304000 Max. :1.519e+09 Max. :0
## NA's :734 NA's :445
## Symbol
## Length:3112903
## Class :character
## Mode :character
##
##
##
##
etfs <- etf_DF[complete.cases(etf_DF),]
etfs$Symbol <- toupper(etfs$Symbol)
summary(etfs) #After remove NAs and adjust Symbols
## Date Open High
## Min. :1999-03-10 Min. : 0 Min. : 0
## 1st Qu.:2010-12-17 1st Qu.: 23 1st Qu.: 23
## Median :2013-09-24 Median : 35 Median : 35
## Mean :2013-03-25 Mean : 1860 Mean : 1954
## 3rd Qu.:2015-11-10 3rd Qu.: 55 3rd Qu.: 55
## Max. :2017-11-10 Max. :34116000 Max. :37152000
## Low Close Volume OpenInt
## Min. : 0 Min. : 0 Min. :0.000e+00 Min. :0
## 1st Qu.: 22 1st Qu.: 23 1st Qu.:9.000e+03 1st Qu.:0
## Median : 34 Median : 35 Median :4.485e+04 Median :0
## Mean : 1782 Mean : 1854 Mean :1.426e+06 Mean :0
## 3rd Qu.: 54 3rd Qu.: 55 3rd Qu.:2.369e+05 3rd Qu.:0
## Max. :33696000 Max. :35304000 Max. :1.519e+09 Max. :0
## Symbol
## Length:3111965
## Class :character
## Mode :character
##
##
##
etfs %>%
filter(Open>10000000)
## # A tibble: 218 x 8
## Date Open High Low Close Volume OpenInt Symbol
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 2010-12-03 22090000 22090000 19530000 19695000 0 0 TVIX
## 2 2010-12-06 19942500 20000000 18655000 18922500 0 0 TVIX
## 3 2010-12-07 18750000 18750000 17525000 18532500 0 0 TVIX
## 4 2010-12-08 18430000 18975000 17500000 17730000 0 0 TVIX
## 5 2010-12-09 16690000 17370000 16690000 16872500 0 0 TVIX
## 6 2010-12-10 16655000 16725000 16362500 16587500 0 0 TVIX
## 7 2010-12-13 16300000 17250000 16300000 17187500 0 0 TVIX
## 8 2010-12-14 17095000 17600000 16982500 17375000 0 0 TVIX
## 9 2010-12-15 17300000 18200000 16962500 18200000 0 0 TVIX
## 10 2010-12-16 18025000 18287500 17447500 17895000 0 0 TVIX
## # ... with 208 more rows
UVXY and TVIX showed extremely high OHLC prices during the certain period.
UVXY has 8 splits in its history (https://www.splithistory.com/uvxy/), with the first starting on March 08, 2012.
TVIX has 5 splits according to split.history.com(https://www.splithistory.com/?symbol=tvix), with the first on December 21, 2012.
We can also see these two ETFs regarding splits via graphs.
a <- list(text = "First Stock Split",
x = '2012-03-08',
y = 1.02,
xref = 'x',
yref = 'paper',
xanchor = 'left',
showarrow = FALSE
)
# use shapes to create a line
l <- list(type = line,
x0 = '2012-03-08',
x1 = '2012-03-08',
y0 = 0,
y1 = 1,
xref = 'x',
yref = 'paper',
line = list(color = 'black',
width = 0.5)
)
p <- etfs %>%
filter(Symbol=="UVXY")%>%
plot_ly(x = ~Date, type="ohlc",
open = ~Open, close = ~Close,
high = ~High, low = ~Low) %>%
layout(title = "UVXY With the First Stock Split",
annotations = a,
shapes = l,
xaxis = list(rangeslider = list(visible = F)))
p
a <- list(text = "First Stock Split",
x = '2012-12-21',
y = 1.02,
xref = 'x',
yref = 'paper',
xanchor = 'left',
showarrow = FALSE
)
# use shapes to create a line
l <- list(type = line,
x0 = '2012-12-21',
x1 = '2012-12-21',
y0 = 0,
y1 = 1,
xref = 'x',
yref = 'paper',
line = list(color = 'black',
width = 0.5)
)
p <- etfs %>%
filter(Symbol=="TVIX")%>%
plot_ly(x = ~Date, type="ohlc",
open = ~Open, close = ~Close,
high = ~High, low = ~Low) %>%
layout(title = "TVIX With the First Stock Split",
annotations = a,
shapes = l,
xaxis = list(rangeslider = list(visible = F)))
p
etfs %>%
filter(Volume>1e+08)
## # A tibble: 6,964 x 8
## Date Open High Low Close Volume OpenInt Symbol
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 2008-11-21 12.0 13.1 11.2 13.0 137101117 0 DDM
## 2 2008-11-21 66.2 69.3 63.9 69.1 106636496 0 DIA
## 3 2007-02-27 33.3 34.5 30.9 31.8 101592969 0 EEM
## 4 2007-07-26 40.4 40.4 38.2 39.2 112175189 0 EEM
## 5 2007-08-01 38.6 39.2 38.1 38.9 120207365 0 EEM
## 6 2007-08-16 34.8 35.3 32.8 34.9 158324089 0 EEM
## 7 2007-08-17 36.2 37.4 34.9 36.0 112881648 0 EEM
## 8 2007-11-28 43.8 45.7 43.8 45.7 114098336 0 EEM
## 9 2008-01-08 43.4 43.9 42.5 42.6 100347472 0 EEM
## 10 2008-01-15 43.0 43.2 41.8 41.9 102457098 0 EEM
## # ... with 6,954 more rows
All data do not seem to be technically abnormal.
Now, I am going to explore the data graphically.
etfs %>%
filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
filter(Date>"2005-01-01")%>%
ggplot(aes(x=Date))+
geom_line(aes(y=High), color="blue")+
geom_line(aes(y=Low), color="grey")+
facet_wrap(~Symbol, scales="free_y")+
ylab(label="High and Low")+
ggtitle("Three ETFs High & Low Prices")
This time, I looked at three famous ETF’s High and Low prices.
Generally, ETFs’ prices had a slightly larger gap between High and Low than Stocks’ ones.
Interestingly, I was able to find a strange looking High price from XLF, and decide to examine it.
etfs %>%
filter(Symbol=="XLF", High>40)
## # A tibble: 1 x 8
## Date Open High Low Close Volume OpenInt Symbol
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 2011-06-16 10.7 52.7 10.6 10.7 186227724 0 XLF
There was only one day during the entire period that XLF’s High price was over 50. I seached the High price online (From Yahoo Finance), and was not able to find any price over 50.
As such, I assume that this was an error, and need to be adjusted.
etfs %>%
ggplot(aes(x=Volume))+geom_histogram()+
ggtitle("ETFs Volume Histogram")
etfs %>%
ggplot(aes(x=log(Volume)))+geom_histogram()+
ggtitle("ETFs log(Volume) Histogram")
Like Stocks data, since the Volume is highly right skewed, it is better to see log(Volume).
I decided to do feature engineering to see more general trends from ETFs data.
etfs <- etfs %>%
group_by(Symbol) %>%
mutate(Open_Change=Open-lag(Open),
High_Change=High-lag(High),
Low_Change=Low-lag(Low),
Close_Change=Close-lag(Close),
Volume_Change=Volume-lag(Volume)) %>%
mutate(Open_PctChange = (Open/lag(Open)-1) * 100,
High_PctChange= (High/lag(High)-1) * 100,
Low_PctChange= (Low/lag(Low)-1) * 100,
Close_PctChange= (Close/lag(Close)-1) * 100,
Volume_PctChange= (Volume/lag(Volume)-1) * 100) %>%
mutate(Open_Mean30=roll_mean(Open, 30, na.rm=TRUE, align="right", fill = NA),
High_Mean30=roll_mean(High, 30, na.rm=TRUE, align="right", fill = NA),
Low_Mean30=roll_mean(Low, 30, na.rm=TRUE, align="right", fill = NA),
Close_Mean30=roll_mean(Close, 30, na.rm=TRUE, align="right", fill = NA),
Volume_Mean30=roll_mean(Volume, 30, na.rm=TRUE, align="right", fill = NA)) %>%
mutate(Open_Ch_Mean30=roll_mean(Open_Change, 30, na.rm=TRUE, align="right", fill = NA),
High_Ch_Mean30=roll_mean(High_Change, 30, na.rm=TRUE, align="right", fill = NA),
Low_Ch_Mean30=roll_mean(Low_Change, 30, na.rm=TRUE, align="right", fill = NA),
Close_Ch_Mean30=roll_mean(Close_Change, 30, na.rm=TRUE, align="right", fill = NA),
Volumne_Ch_Mean30=roll_mean(Volume_Change, 30, na.rm=TRUE, align="right", fill = NA)) %>%
mutate(Open_PctCh_Mean30=roll_mean(Open_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
High_PctCh_Mean30=roll_mean(High_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Low_PctCh_Mean30=roll_mean(Low_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Close_PctCh_Mean30=roll_mean(Close_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
Volumne_PctCh_Mean30=roll_mean(Volume_PctChange, 30, na.rm=TRUE, align="right", fill = NA)) %>%
ungroup()
tail(etfs)
## # A tibble: 6 x 33
## Date Open High Low Close Volume OpenInt Symbol Open_Change
## <date> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr> <dbl>
## 1 2017-11-03 30.7 32.1 30.6 31.8 37680 0 ZSL 0
## 2 2017-11-06 31.8 31.8 30.4 30.4 64802 0 ZSL 1.07
## 3 2017-11-07 31.4 31.6 31.4 31.4 24947 0 ZSL -0.400
## 4 2017-11-08 31.0 31.3 30.4 31.3 48771 0 ZSL -0.410
## 5 2017-11-09 31.5 31.5 31.2 31.3 34983 0 ZSL 0.520
## 6 2017-11-10 31.2 32.1 31.0 31.7 34123 0 ZSL -0.320
## # ... with 24 more variables: High_Change <dbl>, Low_Change <dbl>,
## # Close_Change <dbl>, Volume_Change <int>, Open_PctChange <dbl>,
## # High_PctChange <dbl>, Low_PctChange <dbl>, Close_PctChange <dbl>,
## # Volume_PctChange <dbl>, Open_Mean30 <dbl>, High_Mean30 <dbl>,
## # Low_Mean30 <dbl>, Close_Mean30 <dbl>, Volume_Mean30 <dbl>,
## # Open_Ch_Mean30 <dbl>, High_Ch_Mean30 <dbl>, Low_Ch_Mean30 <dbl>,
## # Close_Ch_Mean30 <dbl>, Volumne_Ch_Mean30 <dbl>,
## # Open_PctCh_Mean30 <dbl>, High_PctCh_Mean30 <dbl>,
## # Low_PctCh_Mean30 <dbl>, Close_PctCh_Mean30 <dbl>,
## # Volumne_PctCh_Mean30 <dbl>
etfs %>%
filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
filter(Date>"2005-01-01")%>%
ggplot(aes(x=Date))+
geom_line(aes(y=High_Mean30), color="blue")+
geom_line(aes(y=Low_Mean30), color="grey")+
facet_wrap(~Symbol, scales="free_y")+
ylab(label="High_Mean30 and Low_Mean30")+
ggtitle("Three ETFs' Mean High & Low Prices every 30 Days")
Just like the Stocks case, now the plot has less granularity.
etfs %>%
filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
filter(Date>"2005-01-01")%>%
ggplot(aes(x=Date, y=Open_Ch_Mean30, color=Symbol))+
geom_line()+
ggtitle("Three ETFs' Avg. Change in Open Prices every 30 Days")
etfs %>%
filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
filter(Date>"2005-01-01")%>%
ggplot(aes(x=Date, y=Open_PctCh_Mean30, color=Symbol))+
geom_line()+
ggtitle("Three ETFs' Avg. % Change in Open Prices every 30 Days")
As we can see from the above two plots, those three ETFs’ stocks have a very similar tendency over time.
Although, there is an ETF that fluctuates more than the others(XLF), they mostly move along together.
XXX_Change(Open_Change, Close_Change, and so on) and XXX_PctChange will help me understand how a stock change daily.XXX_Mean30 will help me see the more general trend of a stock, by examining change in every 30 days.XXX_Ch_Mean30 and XXX_PctCh_Mean30 will help me understand which stocks fluctuate severely than others.